Distributed Data Placement via Graph Partitioning
نویسندگان
چکیده
With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReducestyle workloads. However, relational workloads cannot always be evaluated efficiently using MapReduce without extensive data migrations, which cause network congestion and reduced query throughput. We study the problem of computing data placement strategies that minimize the data communication costs incurred by typical relational query workloads in a distributed setting. Our main contribution is a reduction of the data placement problem to the well-studied problem of GRAPH PARTITIONING, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives (not communication cost). We study several practical extensions of the problem: with load balancing, with replication, with materialized views, and with complex query plans consisting of sequences of intermediate operations that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the GRAPH PARTITIONING solution of the no-replication case. Using the TPCDS workload, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.
منابع مشابه
Graph Partitioning via Parallel Submodular Approximation to Accelerate Distributed Machine Learning
Distributed computing excels at processing large scale data, but the communication cost for synchronizing the shared parameters may slow down the overall performance. Fortunately, the interactions between parameter and data in many problems are sparse, which admits efficient partition in order to reduce the communication overhead. In this paper, we formulate data placement as a graph partitioni...
متن کاملA Scalable Distributed Graph Partitioner
We present Scalable Host-tree Embeddings for Efficient Partitioning (Sheep), a distributed graph partitioning algorithm capable of handling graphs that far exceed main memory. Sheep produces high quality edge partitions an order of magnitude faster than both state of the art offline (e.g., METIS) and streaming partitioners (e.g., Fennel). Sheep’s partitions are independent of the input graph di...
متن کاملDistributed Graph-Partitioning based Coalition Formation for Collaborative Multi-Agent Systems Some Lessons Learned and Challenges Ahead
We study algorithms for distributed collaborative multi-agent coalition formation. The focus of our recent and ongoing research has been on coalition formation via scalable distributed graph partitioning of the underlying agents’ communication network topology. In that endeavor, we have been analyzing, simulating and optimizing our original graph partitioning algorithm called Maximal Clique bas...
متن کاملScalable Linked Data Stream Processing via Network-Aware Workload Scheduling
In order to cope with the ever-increasing data volume, distributed stream processing systems have been proposed. To ensure scalability most distributed systems partition the data and distribute the workload among multiple machines. This approach does, however, raise the question how the data and the workload should be partitioned and distributed. A uniform scheduling strategy—a uniform distribu...
متن کاملOptimal Placement and Sizing of Distributed Generation Via an Improved Nondominated Sorting Genetic Algorithm II
The use of distributed generation units in distribution networks has attracted the attention of network managers due to its great benefits. In this research, the location and determination of the capacity of distributed generation (DG) units for different purposes has been studied simultaneously. The multi-objective functions in the optimization model are reducing system line losses; reducing v...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1312.0285 شماره
صفحات -
تاریخ انتشار 2013